Knowledge transfer between recurrent neural networks

ABSTRACT

Knowledge transfer between recurrent neural networks is performed by obtaining a first output sequence from a bidirectional Recurrent Neural Network (RNN) model for an input sequence, obtaining a second output sequence from a unidirectional RNN model for the input sequence, selecting at least one first output from the first output sequence based on a similarity between the at least one first output and a second output from the second output sequence; and training the unidirectional RNN model to increase the similarity between the at least one first output and the second output.

BACKGROUND Technical Field

The present invention relates to knowledge transfer between RecurrentNeural Networks. More specifically, the present invention relates totransferring knowledge from a bidirectional Recurrent Neural Network toa unidirectional Recurrent Neural Network.

Description of the Related Art

A Recurrent neural network (RNN) is capable of generating an outputsequence of variable size based on an input sequence of variable size.Because of this capability, an RNN is useful for speech recognition,machine translation and any other applications that convert inputsequences (e.g. audio sequence or original text) to output sequences(e.g. phoneme sequence or text translated into a target language).

A unidirectional RNN is an RNN that propagates an input sequence only ina forward direction. A unidirectional RNN can be used in an onlinesystem or real-time system because it requires data preceding thecurrent time. On the other hand, a bidirectional RNN is an RNN thatpropagates an input sequence in both of forward and backward directions.A bidirectional RNN has usually better accuracy than a unidirectionalRNN, but it has a longer latency since it can generate an outputsequence only after an entire input sequence is received. Therefore, itis desired to develop a method of knowledge transfer (teacher-studentmodeling) that can improve the accuracy of a unidirectional RNN bytraining a student unidirectional RNN using a teacher bidirectional RNN.

SUMMARY

According to an embodiment of the present invention, acomputer-implemented method is provided that includes: obtaining a firstoutput sequence from a bidirectional Recurrent Neural Network (RNN)model for an input sequence; obtaining a second output sequence from aunidirectional RNN model for the input sequence; selecting at least onefirst output from the first output sequence based on a similaritybetween the at least one first output and a second output from thesecond output sequence; and training the unidirectional RNN model toincrease the similarity between the at least one selected first outputand the second output. In this way, the accuracy of the unidirectionalRNN can be improved by using at least one selected outputs selected forthe second output instead of using the first output sequence from thebidirectional RNN at the same time index.

Selecting the at least one first output may include searching for the atleast one first output within a predetermined range in the first outputsequence, wherein the predetermined range is determined from an index ofthe second output in the second output sequence. In this way,computational cost can be reduced compared to search the at least onefirst output in all over the first output sequence.

Training the unidirectional RNN model may include training theunidirectional RNN model to increase the similarity between a firstdistribution of the at least one first output and a second distributionof the second output. In this way, training loss is reduced whentransferring knowledge from the bidirectional RNN to the unidirectionalRNN.

The first distribution may be a weighted sum of distributions of each ofthe at least one first output. In this way, the first distribution canbe a better target of the second distribution.

According to another embodiment of the present invention, a computerprogram product is provided that includes one or more computer readablestorage mediums collectively storing program instructions that areexecutable by a processor or programmable circuitry to cause theprocessor or programmable circuitry to perform operations including:obtaining a first output sequence from a bidirectional RNN (RecurrentNeural Network) model for an input sequence, obtaining a second outputsequence from a unidirectional RNN model for the input sequence,selecting at least one first output from the first output sequence basedon a similarity between the at least one first output and a secondoutput from the second output sequence; and training the unidirectionalRNN model to increase the similarity between the at least one firstoutput and the second output.

According to another embodiment of the present invention, an apparatusis provided that includes a processor or a programmable circuitry, andone or more computer readable mediums collectively includinginstructions that, when executed by the processor or the programmablecircuitry, cause the processor or the programmable circuitry to obtain afirst output sequence from a bidirectional RNN (Recurrent NeuralNetwork) model for an input sequence, obtain a second output sequencefrom a unidirectional RNN model for the input sequence, select at leastone first output from the first output sequence based on a similaritybetween the at least one first output and a second output from thesecond output sequence, and train the unidirectional RNN model toincrease the similarity between the at least one first output and thesecond output.

According to another embodiment of the present invention, acomputer-implemented method is provided that includes obtaining a firstoutput sequence from a bidirectional RNN (Recurrent Neural Network)model for an input sequence, obtaining a second output sequence from aunidirectional RNN model for the input sequence, selecting at least onefirst output from the first output sequence, wherein the at least onefirst output includes a first output that appears sequentially earlierin the first output sequence than a second output appears in the secondsequence, and training the unidirectional RNN model to increase thesimilarity between the at least one first output and the second output.Since the bidirectional RNN outputs a first output corresponding to asecond output of a unidirectional RNN sequentially earlier, at least oneappropriate output for the second output can be used to train theunidirectional RNN in this way.

According to another embodiment of the present invention, a computerprogram product is provided that includes one or more computer readablestorage mediums collectively storing program instructions that areexecutable by a processor or programmable circuitry to cause theprocessor or programmable circuitry to perform operations including:obtaining a first output sequence from a bidirectional RNN (RecurrentNeural Network) model for an input sequence; obtaining a second outputsequence from a unidirectional RNN model for the input sequence;selecting at least one first output from the first output sequence,wherein the at least one first output includes a first output thatappears sequentially earlier in the first output sequence than a secondoutput appears in the second sequence; and training the unidirectionalRNN model to increase the similarity between the at least one firstoutput and the second output.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an example of a unidirectional RNN according to anembodiment of the present invention.

FIG. 2 shows an example of an unfolded unidirectional RNN according toan embodiment of the present invention.

FIG. 3 shows an example of an unfolded bidirectional RNN according to anembodiment of the present invention.

FIG. 4 shows an example output probability spikes of a bidirectionalLSTM CTC model and a unidirectional LSTM CTC model in end-to-end speechrecognition according to an embodiment of the present invention.

FIG. 5 shows an apparatus 50 according to an embodiment of the presentinvention.

FIG. 6 shows an operational flow according to an embodiment of thepresent invention.

FIG. 7 shows an example of selecting at least one first output from abidirectional RNN according to an embodiment of the present invention.

FIG. 8 shows an apparatus 80 according to an embodiment of the presentinvention.

FIG. 9 shows an apparatus 90 according to an embodiment of the presentinvention.

FIG. 10 shows an exemplary hardware configuration of a computeraccording to an embodiment of the present invention.

DETAILED DESCRIPTION

Hereinafter, example embodiments of the present invention will bedescribed. The example embodiments shall not limit the inventionaccording to the claims, and the combinations of the features describedin the embodiments are not necessarily essential to the invention.

FIG. 1 shows an example of unidirectional RNN 10. Unidirectional RNN 10inputs an input x_(i) and outputs an output y_(i) in each index (e.g.cycle or time step). For example, unidirectional RNN 10 inputs x₀ andoutputs y₀ at time t=0, inputs x₁ and outputs y₁ at t=1, and so on.Therefore, unidirectional RNN 10 can convert an input sequence (x₀,x_(i), . . . x_(N-1)) to an output sequence (y₀, y₁, . . . y_(N-1))),both having a variable length N. Each input x_(i) and each output y_(i)are vectors including one or more vector elements.

Unidirectional RNN 10 includes RNN layer 100 and softmax 110. RNN layer100 is a neural network layer including a plurality of neurons and aplurality of links (synapses). In this figure, unidirectional RNN 10 hasone neuron layer, meaning each link is connected from an input node to aneuron and from a neuron to an output node. However, unidirectional RNN10 may have multiple recurrent or non-recurrent neuron layers, meaningthe plurality of links would also include at least one recursive linkconnected between neurons, and a recurrent link propagates informationto the next index.

Softmax 110 receives an output of RNN layer 100, applies a softmaxfunction to the output of RNN layer 100, and outputs the result asoutput y_(i). The softmax function is a function that converts an outputvector of RNN layer 100 to a probability distribution including aprobability for each vector element of the output vector of RNN layer100. Unidirectional RNN 10 may adopt any other output function dependingon the application of unidirectional RNN 10.

In this example, unidirectional RNN 10 is based on a general RNN model.Instead of this implementation, unidirectional RNN 10 can be based onany type of RNN model such as LSTM (Long Short-Term Memory), VanillaRNN, GRU (Gated Recurrent Unit) and so on.

FIG. 2 shows an example of a unidirectional RNN 10 unfolded alongindices. Recurrent links in FIG. 1 that propagate information to thenext index are described as links from RNN layer 100 to next RNN layer100 along indices. Every unfolded RNN layer 100 has the same weight oneach recurrent/non-recurrent link. As shown in this figure, informationin RNN layer 100 is propagated toward RNN layer 100 at the next index.Therefore, unidirectional RNN 10 can determine output y_(i) based onpreceding and current inputs (x₀, . . . , x_(i)), and succeeding inputs(x_(i+1), . . . , x_(N−1)) are not required for calculating outputy_(i). Therefore, unidirectional RNN 10 is useful for on-lineapplications such as end-to-end speech recognition.

FIG. 3 shows an example of a bidirectional RNN 30. Bidirectional RNN 30converts an input sequence (x₀, x_(i), . . . x_(N−1)) to an outputsequence (y′₀, y′₁, . . . y′_(N−1)). Bidirectional RNN 30 includes oneor more RNN layer 300 and one or more softmax 310. Each RNN layer 300 isa neural network layer including a plurality of neurons and a pluralityof links. In this figure, bidirectional RNN 30 has one neuron layer ateach index, however, bidirectional RNN 30 may have multiple recurrent ornon-recurrent neuron layers at each index. Each RNN layer 300 at index 0to N−2 includes at least one recurrent link (forward link) connectedfrom a neuron at current index to a neuron at next index. Each RNN layer300 at index 1 to N−1 includes at least one recurrent link (backwardlink) connected from a neuron at the current index to a neuron at theprevious index.

Softmax 310 at index i (i=0, . . . , N−1) receives an output RNN layer300 at index i, applies a softmax function to the output of RNN layer300 and outputs the result as output y′_(i). Instead of using a softmaxfunction, bidirectional RNN 30 may adopt any other output functiondepending on the application of bidirectional RNN 30.

In this example, bidirectional RNN 30 is based on a general RNN model.Instead of this implementation, bidirectional RNN 30 can be based on anytype of RNN model such as LSTM (Long Short-Term Memory), Vanilla RNN,GRU (Gated Recurrent Unit) and so on.

As shown in this figure, information in RNN layer 300 is propagatedtoward RNN layer 300 at the next index and RNN layer 300 at the previousindex. Therefore, bidirectional RNN 30 determines output y′_(i) based onpreceding and current inputs (x₀. . . , x_(i)), and also based onsucceeding inputs (x_(i+1), . . . , x_(N−1)). Therefore, the whole inputsequence must be obtained before bidirectional RNN 30 starts outputtinga valid output sequence.

FIG. 4 shows an example output probability spikes of a bidirectionalLSTM CTC model and a unidirectional LSTM CTC model in end-to-end speechrecognition. In this application, unidirectional RNN 10 implemented as aunidirectional LSTM CTC and bidirectional RNN 30 implemented as abidirectional LSTM CTC are trained by using training data including aplurality of pairs of an audio sequence (a speech sequence) as an inputsequence and a corresponding phoneme sequence as an output sequence.

In this example, CTC (Connectionist Temporal Classification) training isused. In CTC training, if a phoneme sequence of “A B C” is pronounced inan audio sequence of length of 4 (e.g., 4 indices or 4 time steps), anypossible combinations of A ->B ->C sequences of length 4 that caninclude a blank “_” are used as training output sequences. For example,“AABC”, “ABBC”, “ABCC”, “ABC_”, “AB_C”, “A_BC”, and “_ABC” are used astraining output sequences. CTC training maximizes the summation ofprobabilities of possible phoneme sequences, and in doing so allowsblank output for any time frame.

Waveform 400 in FIG. 4 shows a waveform of an audio sequence of apronounced sentence “this is true” (as shown in waveform 400). Thisaudio sequence is sampled at predetermined intervals such as 10 to 100ms, and fed into trained unidirectional RNN 10 and bidirectional RNN 30.As shown in waveform 400, the phoneme sequence corresponding to theaudio sequence of “this is true” is “DH IH S IH Z T R UW” where DH, IH,S, IH, Z, T, R, and UW are phoneme elements.

output sequence 410 shows an output sequence of bidirectional RNN 30corresponding to the audio sequence of “this is true.” In this example,each output of bidirectional RNN 30 and unidirectional RNN 10 is aprobability distribution including a plurality of probabilities each ofwhich is a posterior probability of occurrence of each phoneme. In thisfigure, an output spike having highest posterior probability in theprobability distribution of each output of output sequence is shown as arepresentation of each output. Output sequence 420 shows an outputsequence of unidirectional RNN 10 corresponding to the audio sequence of“this is true.” Temporal (horizontal) axes of waveform 400, outputsequence 410, and output sequence 420 in FIG. 4 are aligned.

By comparing the output spikes of bidirectional RNN 30 andunidirectional RNN 10, bidirectional RNN 30 and unidirectional RNN 10output corresponding phoneme spikes at different indices or time steps.By comparing waveform 400 and output sequence 410, bidirectional RNN 30outputs phoneme spikes at indices sequentially earlier than the indicesat which the corresponding actual phoneme sounds are input. This isbecause each RNN layer 300 in bidirectional RNN 30 propagatesinformation backwards, and thus each RNN layer 300 can output y′_(i)based on future information. On the other hand, by comparing waveform400 and output sequence 420, unidirectional RNN 10 outputs phonemespikes at indices sequentially later than the indices at which thecorresponding actual phoneme sounds are input.

Therefore, a conventional teacher-student modeling that use an outputsequence from a teacher model as a training output sequence for traininga student model may not improve the accuracy of student unidirectionalRNN 10 if a bidirectional RNN 30 is used as a teacher model.

FIG. 5 shows an apparatus 50 according to an embodiment of the presentinvention. Apparatus 50 can improve the accuracy of a unidirectional RNN10 by training student unidirectional RNN 10 using teacher bidirectionalRNN 30. Apparatus 50 includes first model storage 500, first trainingsection 510, second model storage 520, receiving section 530, firstobtaining section 540, second obtaining section 550, selecting section560, and second training section 580.

First model storage 500 stores a bidirectional RNN model, which is amodel of bidirectional RNN 30. In this implementation shown in thisfigure, the bidirectional RNN model is a data structure representingbidirectional RNN 30. In other implementations, the bidirectional RNNmodel can be implemented as a hardware device or implemented in aprogrammable device. The reference: “Framewise phoneme classificationwith bidirectional LSTM and other neural network architectures,” by AlexGraves et al., Neural Networks, 18(5-6): page 602-610, July 2005 showsan example of a network structure and trainable parameters of an LSTM(as an example of an RNN). The reference also shows calculations fortraining parameters of the LSTM. In one implementation, apparatus 50 canuse the LSTM shown in the reference and adopt the calculations shown inthe same. In other implementations, apparatus 50 may use other RNN andadopt calculations for training parameters that are suitable for theRNN.

First training section 510 is connected to first model storage 500, andfirst training section 510 trains the bidirectional RNN model. In thisembodiment, first training section 510 receives training sequencesincluding a plurality of pairs of a training input sequence and atraining output sequence, and trains the bidirectional RNN model infirst model storage 500 before training the unidirectional RNN modelstored in second model storage 520. In one embodiment, the trainingoutput sequence for training the bidirectional RNN model includes aplurality of training output each of which represents a selected output(e.g. a phoneme), instead of a probability distribution including aprobability of occurrence for each output option. First training section510 is also connected to first obtaining section 540, and first trainingsection 510 sends a training input sequence to first obtaining section540 and receives an output sequence of the current bidirectional RNNmodel stored in first model storage 500 from first obtaining section 540described later in detail. First training section 510 updates parametersof the bidirectional RNN model to decrease the difference between thetraining output sequence and the output sequence calculated by firstobtaining section 540. In one implementation, first training section 510trains a bidirectional LSTM model by using Connectionist TemporalClassification (CTC) training. First training section 510 can adoptknown training algorithms for updating parameters of the bidirectionalRNN model.

Second model storage 520 stores a unidirectional RNN model, which is amodel of unidirectional RNN 10. In this implementation shown in thisfigure, the unidirectional RNN model is a data structure representingunidirectional RNN 10. In other implementations, the unidirectional RNNmodel can be implemented as a hardware device or implemented in aprogrammable device.

Receiving section 530 receives an input sequence for training theunidirectional RNN model as a student by using the bidirectional RNNmodel as a teacher. For example, the input sequence is an audio sequenceof a speech.

First obtaining section 540 is connected to receiving section 530 andreceives the input sequence from receiving section 530. First obtainingsection 540 is also connected to first model storage 500, and firstobtaining section 540 obtains a first output sequence from thebidirectional RNN model in first model storage 500 for the inputsequence by calculating the first output sequence based on the currentbidirectional RNN model in first model storage 500 and the inputsequence. For example, the first output sequence is a phoneme sequencecorresponding to the input audio sequence of a speech. In thisimplementation, bidirectional RNN 30 is implemented by the datastructure in first model storage 500 and a calculation part in firstobtaining section 540.

Second obtaining section 550 is connected to receiving section 530 andreceives the same input sequence from receiving section 530. Secondobtaining section 550 is also connected to second model storage 520, andsecond obtaining section 550 obtains a second output sequence from aunidirectional RNN model in second model storage 520 for the inputsequence by calculating the second output sequence based on the currentunidirectional RNN model in second model storage 520 and the inputsequence. For example, the second output sequence is a phoneme sequencecorresponding to the input audio sequence of a speech. In thisimplementation, unidirectional RNN 10 is implemented by the datastructure in second model storage 520 and the calculation part in secondobtaining section 550.

Selecting section 560 is connected to first obtaining section 540 andsecond obtaining section 550. Selecting section 560 receives the firstoutput sequence from first obtaining section 540 and the second outputsequence from second obtaining section 550. As shown in FIG. 4,bidirectional RNN 30 outputs a vector element in the first outputsequence sequentially earlier than unidirectional RNN 10 outputs acorresponding vector element in the second output sequence. Therefore,selecting section 560 selects, for each of one or more second outputs inthe second output sequence, at least one first output from the firstoutput sequence based on a similarity between the at least one firstoutput and the second output from the second output sequence. By thisfeature, selecting section 560 can select at least one appropriatetraining output in the first output sequence for each second output inthe second output sequence.

Selecting section 560 includes searching section 570. For each secondoutput, searching section 570 searches for the at least one first outputin the first output sequence.

Second training section 580 is connected to second model storage 520,receiving section 530, second obtaining section 550 and selectingsection 560. Second training section 580 receives the input sequencefrom receiving section 530, the second output sequence from secondobtaining section 550, the at least one first output for each secondoutput from selecting section 560. Second training section 580 trainsthe unidirectional RNN model to increase the similarity between the atleast one first output and the second output for each second output inthe second output sequence. In this embodiment, second training section580 updates parameters of the unidirectional RNN model in second modelstorage 520 to decrease the difference between the at least one firstoutput and the second output sequence calculated by second obtainingsection 550.

FIG. 6 shows an operational flow according to an embodiment of thepresent invention. The operations of FIG. 6 can be performed by anapparatus, such as apparatus 50 and its components that were explainedin reference to FIG. 5. While the operational flow of FIG. 6 will beexplained in reference to apparatus 50 and its components, theoperational flow can be performed by other apparatus having differentcomponents as well.

At S600, first training section 510 receives training sequences andtrains the bidirectional RNN model in first model storage 500. In oneimplementation, first training section 510 adopts CTC training.

At S610, receiving section 530 receives an input sequence for trainingthe unidirectional RNN model as a student. In other embodiment,receiving section 530 also receives a training output sequence for theinput sequence. In this case, receiving section 530 forwards the pair ofthe input sequence and the training output sequence to first trainingsection 510 and first training section 510 then trains the bidirectionalRNN model in first model storage 500 based on this pair of input andoutput sequences.

At S620, first obtaining section 540 obtains a first output sequencefrom the bidirectional RNN model in first model storage 500 for theinput sequence received in S610.

At S630, second obtaining section 550 obtains a second output sequencefrom a unidirectional RNN model in second model storage 520 for theinput sequence received in S610.

At S640, selecting section 560 selects, for each of one or more secondoutput in the second output sequence, at least one first output from thefirst output sequence based on a similarity between the at least onefirst output and the second output of interest. Selecting section 560can also do this by selecting, for each of the one or more first outputsin the first output sequence, one or more second outputs from the secondoutput sequence based on a similarity between the first output ofinterest and the one or more second outputs. Searching section 570searches for the at least one first output in the first sequence foreach second output. In one implementation, searching section 570 cansearch for a predetermined number (e.g. one or more) of the firstoutputs having the highest similarity to the second output of interest.The selected first outputs are consecutive outputs in the first outputsequence, but the selected first output can be non-consecutive outputsin other implementations.

In other implementations, selecting section 560 may select the at leastone first output only for each second output of a part of the secondoutput sequence.

At S650, second training section 580 trains the unidirectional RNN modelby using the input sequence, the second output sequence, and the atleast one selected first output for each second output to increase thesimilarity between the at least one selected first output and thecorresponding second output. Second training section 580 can train theunidirectional RNN model with respect to each second output which has atleast one corresponding first output.

At S660, apparatus 50 checks whether teacher-student training iscompleted for all input sequences received by receiving section 530. Ifthere are some remaining input sequences, apparatus 50 repeats S610 toS650 for each of the remaining input sequences. In one embodiment, theoperational flow of FIG. 6 is executed two or more times for each inputsequence.

In this embodiment, apparatus 50 can select, for each second output, atleast one first output which is similar to the second output. Therefore,apparatus 50 can improve the accuracy of unidirectional RNN 10 by usingtraining outputs adjusted to unidirectional RNN 10 instead of just usingthe first output sequence from bidirectional RNN 30.

FIG. 7 shows an example of selecting at least one first output from abidirectional RNN according to an embodiment of the present invention.While FIG. 7 will be explained in reference to apparatus 50 and itscomponents, and the operational flow of FIG. 6, the example of FIG. 7can be applied to other apparatus having different components and/ordifferent operational flows as well.

At S620 and S630 in FIG. 6, the same input sequence is input tobidirectional RNN 30, which includes bidirectional RNN layers 300 andsoftmaxes 310, and unidirectional RNN 10, which includes unidirectionalRNN layers 100 and softmaxes 110, and bidirectional RNN 30 andunidirectional RNN 10 output the first output sequence and the secondoutput sequence, respectively. At S640 in FIG. 6, selecting section 560compares each second output with each first output (at the same ordifferent indices), and selects at least one first output based on asimilarity between the at least one first output and the second output.Selecting section 560 can adopt some of the following selectionpolicies.

(1) In one implementation, selecting section 560 selects the at leastone first output including an output that appears sequentially earlierin the first output sequence than the second output of interest appearsin the second output sequence. Even though some of the selected firstoutputs appear sequentially the same or later in the first outputsequence, a first output that appears sequentially earlier than thesecond output of interest may improve the accuracy of unidirectional RNN10.

(2) In one implementation, searching section 570 searches for the atleast one first output between the earliest output (e.g. output y₀ inthe first output sequence) and the first output at the time step that isthe same as the second output of interest (e.g. same index). In otherimplementations, searching section 570 searches for the at least onefirst output within a predetermined range in the first output sequence.Selecting section 560 determines the predetermined range from an indexof the second output of interest. Selecting section 560 can determine arange of fixed length (e.g. 10) relative to an index of the secondoutput of interest (e.g. a range at fixed distance from the secondoutput of interest). Based on this range, searching section 570 cansearch, for the second output y_(i) in the second output sequence, theat least one first output among output y′_(i−13) to y′_(i−4) in thefirst output sequence as an example. In this implementation, selectingsection 560 can search for the at least one first output within alimited range of the first sequence, and thus the computational cost canbe decreased. If the range is large enough to cover the difference oftime steps for bidirectional RNN 30 and unidirectional RNN 10 to outputthe same phoneme spike, then this does not decrease the accuracy ofunidirectional RNN 10.

(3) In one implementation, selecting section 560 selects, for eachsecond output from the second output sequence, at least one firstcorresponding output from the first output sequence, satisfying acondition that each second output appears in the same relativesequential order in the second output sequence as each of the at leastone corresponding first output in the first output sequence. Forexample, selecting section 560 selects the at least one first output forthe second output of interest at time step i in the second sequencebetween the first output next to the first output selected for thesecond output at time step i−1 and the first output at index i. Thismeans selecting section 560 selects the first output between time stepy′_(i−9) and y″_(i−1) for second output y_(i) if first output y′_(i−10)is selected for second output y⁻¹. In this implementation, the order ofphoneme spikes is maintained between the second outputs and the selectedfirst outputs, and thus the assignment of the first outputs to thesecond outputs will be closer to the actual correspondence relationship.Selecting section 560 can also search for the at least one first outputwithin the limited range of the first sequence, and thus thecomputational cost can be decreased.

(4) In one implementation, bidirectional RNN 30 and unidirectional RNN10 output a probability distribution as each first output and eachsecond output. For example, each first output y′_(i) and each secondoutput y_(i) includes a plurality of probabilities each of which is aprobability of occurrence of a corresponding phoneme (e.g. probabilityof occurrence of “DH” is X, probability of occurrence of “IH” is Y, andso on). In this case, selecting section 560 selects the at least onefirst output for the second output based on a similarity between a firstdistribution of the at least one first output and a second distributionof the second output. Second training section 580 trains theunidirectional RNN model to increase the similarity between a firstdistribution of the at least one first output and a second distributionof the second output.

Selecting section 560 can calculate the similarity between the firstdistribution and the second distribution based on a Kullback-Leibler(KL) divergence between the first distribution and the seconddistribution. KL divergence becomes a low value if the firstdistribution and the second distribution are similar, and KL divergencebecomes a high value if the first distribution and the seconddistribution are different. Therefore, selecting section 560 can use asimilarity measurement that becomes higher if KL divergence is lower andbecomes lower if KL divergence is higher. In one example, selectingsection 560 can use a similarity inversely proportional to KLdivergence. In this case, second training section 580 trains theunidirectional RNN model to decrease Kullback-Leibler (KL) divergencebetween the first distribution and the second distribution.

(5) In one implementation, selecting section 560 selects one firstoutput for each second output. In other implementations, selectingsection 560 selects two or more first outputs for each second output.Selecting section 560 can select two or more consecutive first outputsin the first output sequence. In an implementation, selecting section560 selects a fixed number of first outputs. Selecting section 560merges selected first outputs to obtain a training output by calculatinga weighted sum of the first outputs, averaging the two or more firstoutputs, or other operations. Selecting section 560 can determineweights of the weighted sum based on the similarity of each first outputindividually compared with the second output of interest. In this case,selecting section 560 assigns higher weight to a first output that ismore similar to the second output of interest. Selecting section 560 canassign, for each first output of the two or more first outputs, a weightthat is proportional to the similarity between each first output and thesecond output of interest. In other implementations, selecting section560 can assign each weight based on the similarity by using differentcalculation. For example, selecting section 560 can assign a weight thatis proportional to the similarity squared, cubed and so on.

Searching section 570 searches for the two or more consecutive outputsin the first output sequence based on a similarity between the two ormore consecutive outputs and the second output of interest. Thissimilarity can be equal to a similarity between a representative outputof the two or more consecutive outputs (e.g., a weighted sum of the twoor more consecutive outputs) and the second output of interest. Forexample, searching section 570 searches for the two or more consecutiveoutputs having the representative output most similar to the secondoutput of interest.

In cases where each first output is a probability distribution asdescribed in (4), selecting section 560 calculates the firstdistribution as a weighted sum of distributions of each of the two ormore first outputs and selecting section 560 can determine weights ofthe weighted sum of distributions of each of the two or more firstoutputs based on similarity of distribution of each of the two or morefirst output individually compared with the second output of interest.In this case, selecting section 560 assigns higher weight to a firstoutput having a distribution which is more similar to the distributionof the second output of interest. Therefore, the first distribution canbe a better target of the second distribution than an averageddistribution of the first outputs.

(6) In one implementation, selecting section 560 selects at least onefirst output for each of all second outputs based on the similaritybetween the at least one first output and the second output of interest.In other implementations, selecting section 560 selects at least onefirst output for at least one second output based on the differentselection rule. For example, if the same first output is selected forsecond outputs y_(i−1) and y_(i+1), then selecting section 560 canselect the same first output to second output y_(i) of interest. Inanother example, if bidirectional RNN 30 and unidirectional RNN 10 havethe same number of indices, then selecting section 560 may not selectfirst outputs appearing sequentially earlier in the first outputsequence than several earliest second outputs y₀ to y_(m) (m<N−1) appearin the second output sequence. In this case, selecting section 560 canselect a first output (e.g. y′₀) sequentially earliest in the firstoutput sequence for second outputs y₀ to y_(m).

In other implementations, selecting section 560 may select at least onefirst output only for some of the second outputs.

In other implementations, the number of indices of bidirectional RNN 30and unidirectional RNN 10 can be different. For example, bidirectionalRNN 30 can have indices such as −2, −1, which are sequentially earlierthan the earliest second output y₀ in the second sequence.

FIG. 8 shows an apparatus 80 according to an embodiment of the presentinvention. Apparatus 80 is a variation of apparatus 50 in FIG. 5. Thesame reference number is assigned to each component in apparatus 80which has similar functions and configurations of the correspondingcomponent in apparatus 50. Hereinafter, only the differences fromapparatus 50 in FIG. 5 are explained, and any other explanation ofapparatus 50 in FIG. 5 can also generally be applied to apparatus 80 inFIG. 8.

Selecting section 560 of apparatus 80 does not include searching section570 shown in FIG. 5. Instead of searching for the at least one firstoutput in the first output sequence for each second output, selectingsection 560 selects, for each second output, at least one first outputfrom the first output sequence, wherein the at least one first outputincludes a first output that appears sequentially earlier in the firstoutput sequence than the second output appears in the second sequence.

In this embodiment, selecting section 560 selects at least one firstoutput within a range of a fixed length relative to the index of thesecond output. For example, selecting section 560 selects, for secondoutput y_(i) of interest, a fixed number (e.g. 5) of consecutive firstoutputs (e.g. y′_(i−4) to y′_(i)) backward from a first output at thesame index with the second output y_(i) of interest. In other words,selecting section 560 sets an index window at a fixed relative locationfrom the index of the second output of interest, and selects all firstoutput in the index window for the second output of interest.

Selecting section 560 merges selected first outputs to obtain a trainingoutput by calculating a weighted sum of the first outputs, averaging theselected first outputs, and so on as shown in (5) explained in referenceto FIG. 6. In this embodiment, apparatus 80 can obtain a training outputfor each second output of interest without searching the first outputsequence, and thus apparatus 80 can reduce computational cost.

FIG. 9 shows an apparatus 90 according to an embodiment of the presentinvention. Apparatus 90 is a variation of apparatus 50 in FIG. 5. Thesame reference number is assigned to each component in apparatus 90which has similar functions and configurations of the correspondingcomponent in apparatus 50. Hereinafter, only the differences fromapparatus 50 in FIG. 5 are explained, and any other explanation ofapparatus 50 in FIG. 5 can also generally be applied to apparatus 90 inFIG. 9.

Apparatus 90 includes second training section 980 instead of secondtraining section 580 in apparatus 50, and the other components aresubstantially the same as in apparatus 50. Second training section 980has substantially the same functions and configurations of secondtraining section 580, and second training section 980 has additionalfunction and configuration for training the unidirectional RNN model insecond model storage 520 in a supervised training by using one or morepairs of a training input sequence and a training output sequence.

Second training section 980 receives one or more training sequences eachincluding a pair of a training input sequence and a training outputsequence. Second training section 980 sends the training input sequenceto second obtaining section 550 and receives an output sequence of thecurrent unidirectional RNN model stored in second model storage 520.Second training section 980 updates parameters of the unidirectional RNNmodel to decrease the difference between the training output sequenceand the output sequence calculated by second obtaining section 550. Inone implementation, second training section 980 trains theunidirectional LSTM model by using Connectionist Temporal Classification(CTC) training. Second training section 980 can adopt a known trainingalgorithm for updating parameters of the bidirectional RNN model.

In this embodiment, apparatus 90 repeats a supervised training using oneor more training sequences received by second training section 980 andteacher-student training using one or more input sequences received byreceiving section 530, alternatively. Instead of this, apparatus 90 cantrain unidirectional RNN 10 by a supervised training at first, and afterthis training, apparatus 90 can train unidirectional RNN 10 by ateacher-student training.

In this embodiment, apparatus 90 can improve accuracy of unidirectionalRNN 10 by combining a supervised training and a teacher-studenttraining.

Various embodiments of the present invention may be described withreference to flowcharts and block diagrams whose blocks may represent(1) steps of processes in which operations are performed or (2) sectionsof apparatuses responsible for performing operations. Certain steps andsections may be implemented by dedicated circuitry, programmablecircuitry supplied with computer-readable instructions stored oncomputer-readable media, and/or processors supplied withcomputer-readable instructions stored on computer-readable media.Dedicated circuitry may include digital and/or analog hardware circuitsand may include integrated circuits (IC) and/or discrete circuits.Programmable circuitry may include reconfigurable hardware circuitsincluding logical AND, OR, XOR, NAND, NOR, and other logical operations,flip-flops, registers, memory elements, etc., such as field-programmablegate arrays (FPGA), programmable logic arrays (PLA), etc.

Computer-readable media may include any tangible device that can storeinstructions for execution by a suitable device, such that thecomputer-readable medium having instructions stored therein comprises anarticle of manufacture including instructions which can be executed tocreate means for performing operations specified in the flowcharts orblock diagrams. Examples of computer-readable media may include anelectronic storage medium, a magnetic storage medium, an optical storagemedium, an electromagnetic storage medium, a semiconductor storagemedium, etc. More specific examples of computer-readable media mayinclude a floppy disk, a diskette, a hard disk, a random access memory(RAM), a read-only memory (ROM), an erasable programmable read-onlymemory (EPROM or Flash memory), an electrically erasable programmableread-only memory (EEPROM), a static random access memory (SRAM), acompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a BLU-RAY(RTM) disc, a memory stick, an integrated circuit card, etc.

Computer-readable instructions may include assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Smalltalk, JAVA(RTM), C++, etc.,and conventional procedural programming languages, such as the “C”programming language or similar programming languages.

Computer-readable instructions may be provided to a processor of ageneral purpose computer, special purpose computer, or otherprogrammable data processing apparatus, or to programmable circuitry,locally or via a local area network (LAN), wide area network (WAN) suchas the Internet, etc., to execute the computer-readable instructions tocreate means for performing operations specified in the flowcharts orblock diagrams. Examples of processors include computer processors,processing units, microprocessors, digital signal processors,controllers, microcontrollers, etc.

FIG. 10 shows an example of a computer 1200 in which aspects of thepresent invention may be wholly or partly embodied. A program that isinstalled in the computer 1200 can cause the computer 1200 to functionas or perform operations associated with apparatuses of the embodimentsof the present invention or one or more sections thereof, and/or causethe computer 1200 to perform processes of the embodiments of the presentinvention or steps thereof. Such a program may be executed by the CPU1212 to cause the computer 1200 to perform certain operations associatedwith some or all of the blocks of flowcharts and block diagramsdescribed herein.

The computer 1200 according to the present embodiment includes a CPU1212, a RAM 1214, a graphics controller 1216, and a display device 1218,which are mutually connected by a host controller 1210. The computer1200 also includes input/output units such as a communication interface1222, a hard disk drive 1224, a DVD-ROM drive 1226 and an IC card drive,which are connected to the host controller 1210 via an input/outputcontroller 1220. The computer also includes legacy input/output unitssuch as a ROM 1230 and a keyboard 1242, which are connected to theinput/output controller 1220 through an input/output chip 1240.

The CPU 1212 operates according to programs stored in the ROM 1230 andthe RAM 1214, thereby controlling each unit. The graphics controller1216 obtains image data generated by the CPU 1212 on a frame buffer orthe like provided in the RAM 1214 or in itself, and causes the imagedata to be displayed on the display device 1218.

The communication interface 1222 communicates with other electronicdevices via a network. The hard disk drive 1224 stores programs and dataused by the CPU 1212 within the computer 1200. The DVD-ROM drive 1226reads the programs or the data from the DVD-ROM 1201, and provides thehard disk drive 1224 with the programs or the data via the RAM 1214. TheIC card drive reads programs and data from an IC card, and/or writesprograms and data into the IC card.

The ROM 1230 stores therein a boot program or the like executed by thecomputer 1200 at the time of activation, and/or a program depending onthe hardware of the computer 1200. The input/output chip 1240 may alsoconnect various input/output units via a parallel port, a serial port, akeyboard port, a mouse port, and the like to the input/output controller1220.

A program is provided by computer readable media such as the DVD-ROM1201 or the IC card. The program is read from the computer readablemedia, installed into the hard disk drive 1224, RAM 1214, or ROM 1230,which are also examples of computer readable media, and executed by theCPU 1212. The information processing described in these programs is readinto the computer 1200, resulting in cooperation between a program andthe above-mentioned various types of hardware resources. An apparatus ormethod may be constituted by realizing the operation or processing ofinformation in accordance with the usage of the computer 1200.

For example, when communication is performed between the computer 1200and an external device, the CPU 1212 may execute a communication programloaded onto the RAM 1214 to instruct communication processing to thecommunication interface 1222, based on the processing described in thecommunication program. The communication interface 1222, under controlof the CPU 1212, reads transmission data stored on a transmissionbuffering region provided in a recording medium such as the RAM 1214,the hard disk drive 1224, the DVD-ROM 1201, or the IC card, andtransmits the read transmission data to a network or writes receptiondata received from a network to a reception buffering region or the likeprovided on the recording medium.

In addition, the CPU 1212 may cause all or a necessary portion of a fileor a database to be read into the RAM 1214, the file or the databasehaving been stored in an external recording medium such as the hard diskdrive 1224, the DVD-ROM drive 1226 (DVD-ROM 1201), the IC card, etc.,and perform various types of processing on the data on the RAM 1214. TheCPU 1212 may then write back the processed data to the externalrecording medium.

Various types of information, such as various types of programs, data,tables, and databases, may be stored in the recording medium to undergoinformation processing. The CPU 1212 may perform various types ofprocessing on the data read from the RAM 1214, which includes varioustypes of operations, processing of information, condition judging,conditional branch, unconditional branch, search/replace of information,etc., as described throughout this disclosure and designated by aninstruction sequence of programs, and writes the result back to the RAM1214. In addition, the CPU 1212 may search for information in a file, adatabase, etc., in the recording medium. For example, when a pluralityof entries, each having an attribute value of a first attributeassociated with an attribute value of a second attribute, are stored inthe recording medium, the CPU 1212 may search for an entry matching thecondition whose attribute value of the first attribute is designated,from among the plurality of entries, and read the attribute value of thesecond attribute stored in the entry, thereby obtaining the attributevalue of the second attribute associated with the first attributesatisfying the predetermined condition.

The above-explained program or software modules may be stored in thecomputer readable media on or near the computer 1200. In addition, arecording medium such as a hard disk or a RAM provided in a serversystem connected to a dedicated communication network or the Internetcan be used as the computer readable media, thereby providing theprogram to the computer 1200 via the network.

In an embodiment of the present invention, one or more computers 1200(shown in FIG. 10) may be configured to embody a bidirectional RNN modeland a unidirectional RNN model. For example, a first model storage 500(shown in FIG. 5) and a second model storage 520 may be provided in oneor more hard disk drives 1224. A first training section 510 and a secondtraining section 580 may be implemented by one or more processors, suchas CPU 1212 executing program code stored on RAM 1214, ROM 1230 or theone or more hard disk drives 1226, collectively referred to as memory.Alternatively, the one or more processors (CPU 1212) may be dedicatedneural network processors, machine learning processors or appropriatelyconfigured FPGA devices. Additionally, the selecting section 560 may beimplemented by the one or more processors 1212 executing program codeheld in memory. Also, the one or more processors 1212 may executeprogram code held in memory to perform as a first obtaining section 540and a second obtaining section 550. The receiving section 530 may beembodied in the I/O chip 1240 coupled to, for example, a microphone,and/or a communication interface 1222 coupled to a network.

The present embodiment may be used to implement a speech recognitionsystem, for example. The present embodiment, implemented as a speechrecognition system, may be configured to generate, by way of the firsttraining section 510, a first phoneme sequence as a first outputsequence corresponding to an input training sequence (e.g., input audiosequence of speech) received by the receiving section 530. The firstphoneme sequence is generated by the bidirectional RNN model stored inthe first model storage 500. A second phoneme sequence is generated, byway of the second training section 580, as a second output sequencecorresponding to the input training sequence received by the receivingsection 530. The second phoneme sequence is generated by theunidirectional RNN model stored in the second model storage 520.

The present embodiment may also select at least one first phoneme fromthe first phoneme sequence based on a similarity between the at leastone first phoneme and a second phoneme from the second phoneme sequence.The at least one first phoneme generated by the bidirectional RNN modelis used to train the unidirectional RNN model to increasing thesimilarity between the at least one first phoneme and the secondphoneme.

The speech recognition system of the present embodiment may interpret ameaning of new second phoneme sequences generated by the trainedunidirectional RNN model from input audio sequences received via thereceiving section 530, for example, from the microphone coupled to theI/O chip 1240. Based on the interpreted meaning of the new secondphoneme sequences the present embodiment controls an action performed bythe device (e.g., the computer 1200). Actions of the device that may becontrolled by the present embodiment may include activating anapplication, such as a music player (for example, when a phonemesequence is interpreted as “play music”), transcribe the phonemesequence into text, as in speech-to-text transcription system, translatethe phoneme sequence from the native language into a target language, asin a speech translator system.

Additionally, the present embodiment may be configured to controlexternal devices, for example smart home automation devices. Forexample, when the present embodiment interprets a new second phonemesequence as “turn of the lights”, the computer 1200 may control one ormore smart light switches, or smart light bulbs by sending a controlsignal to the lighting devices by way of the communication interface1222. In addition to controlling lighting devices, the presentembodiment may be configured to interact with smart thermostats, smartappliances and other Internet of things enabled devices.

While the embodiments of the present invention have been described, thetechnical scope of the invention is not limited to the above describedembodiments. It will be apparent to persons skilled in the art thatvarious alterations and improvements can be added to the above-describedembodiments. It should also apparent from the scope of the claims thatthe embodiments added with such alterations or improvements are withinthe technical scope of the invention.

The operations, procedures, steps, and stages of each process performedby an apparatus, system, program, and method shown in the claims,embodiments, or diagrams can be performed in any order as long as theorder is not indicated by “prior to,” “before,” or the like and as longas the output from a previous process is not used in a later process.Even if the process flow is described using phrases such as “first” or“next” in the claims, embodiments, or diagrams, it does not necessarilymean that the process must be performed in this order.

The descriptions of the various embodiments of the present inventionhave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the describedembodiments. The terminology used herein was chosen to best explain theprinciples of the embodiments, the practical application or technicalimprovement over technologies found in the marketplace, or to enableothers of ordinary skill in the art to understand the embodimentsdisclosed herein.

What is claimed is:
 1. A computer-implemented method comprising:obtaining a first output sequence from a bidirectional Recurrent NeuralNetwork (RNN) model for an input sequence; obtaining a second outputsequence from a unidirectional RNN model for the input sequence;selecting at least one first output from the first output sequence basedon a similarity between the at least one first output and a secondoutput from the second output sequence; and training the unidirectionalRNN model to increase the similarity between the at least one firstoutput and the second output.
 2. The computer-implemented method ofclaim 1, wherein the at least one first output includes an output thatappears sequentially earlier in the first output sequence than thesecond output appears in the second output sequence.
 3. Thecomputer-implemented method of claim 1, wherein selecting the at leastone first output includes searching for the at least one first outputwithin a predetermined range in the first output sequence, wherein thepredetermined range is determined from an index of the second output inthe second output sequence.
 4. The computer-implemented method of claim1, wherein selecting the at least one first output includes selecting,for each second output from the second output sequence, at least onefirst corresponding output from the first output sequence, wherein eachsecond output appears in a same relative sequential order in the secondoutput sequence as each of the at least one corresponding first outputin the first output sequence.
 5. The computer-implemented method ofclaim 1, wherein training the unidirectional RNN model includes trainingthe unidirectional RNN model to increase the similarity between a firstdistribution of the at least one first output and a second distributionof the second output.
 6. The computer-implemented method of claim 5,wherein training the unidirectional RNN model includes training theunidirectional RNN model to decrease Kullback-Leibler (KL) divergencebetween the first distribution and the second distribution.
 7. Thecomputer-implemented method of claim 5, wherein the first distributionis a weighted sum of distributions of each of the at least one firstoutput.
 8. The computer-implemented method of claim 7, wherein selectingthe at least one first output includes determining weights of theweighted sum of distributions of each of the at least one first outputbased on similarity of distribution of each of the at least one firstoutput individually compared with the second output.
 9. Thecomputer-implemented method of claim 1, further comprising training thebidirectional RNN model before training the unidirectional RNN model.10. The computer-implemented method of claim 1, wherein thebidirectional RNN model is a bidirectional Long Short-Term Memory (LSTM)and the unidirectional RNN model is a unidirectional LSTM.
 11. Thecomputer-implemented method of claim 10, further comprising training theunidirectional LSTM model by using Connectionist Temporal Classification(CTC) training.
 12. The computer-implemented method of claim 1, whereinthe input sequence is an audio sequence of a speech, and the firstoutput sequence and the second output sequence are phoneme sequences.13. A computer program product including one or more computer readablestorage mediums collectively storing program instructions that areexecutable by a processor or programmable circuitry to cause theprocessor or programmable circuitry to perform operations comprising:obtaining a first output sequence from a bidirectional RNN (RecurrentNeural Network) model for an input sequence; obtaining a second outputsequence from a unidirectional RNN model for the input sequence;selecting at least one first output from the first output sequence basedon a similarity between the at least one first output and a secondoutput from the second output sequence; and training the unidirectionalRNN model to increase the similarity between the at least one firstoutput and the second output.
 14. The computer program product of claim13, wherein selecting the at least one first output includes searchingfor the at least one first output within a predetermined range in thefirst output sequence, wherein the predetermined range is determinedfrom an index of the second output in the second output sequence. 15.The computer program product of claim 13, wherein training theunidirectional RNN model includes training the unidirectional RNN modelto increase the similarity between a first distribution of the at leastone first output and a second distribution of the second output.
 16. Thecomputer program product of claim 13, wherein the first distribution isa weighted sum of distributions of each of the at least one firstoutput.
 17. An apparatus comprising: a processor or a programmablecircuitry; and one or more computer readable mediums collectivelyincluding instructions that, when executed by the processor or theprogrammable circuitry, cause the processor or the programmablecircuitry to: obtain a first output sequence from a bidirectional RNN(Recurrent Neural Network) model for an input sequence; obtain a secondoutput sequence from a unidirectional RNN model for the input sequence;select at least one first output from the first output sequence based ona similarity between the at least one first output and a second outputfrom the second output sequence; and train the unidirectional RNN modelto increase the similarity between the at least one first output and thesecond output.
 18. The apparatus of claim 17, wherein the instructionsthat cause the processor or the programmable circuitry to select the atleast one first output include instructions that cause the processor orthe programmable circuitry to search for the at least one first outputwithin a predetermined range in the first output sequence, wherein thepredetermined range is determined from an index of the second output inthe second output sequence.
 19. The apparatus of claim 17, whereininstructions that cause the processor or the programmable circuitry totrain the unidirectional RNN model include instructions that cause theprocessor or the programmable circuitry to train the unidirectional RNNmodel to increase the similarity between a first distribution of the atleast one first output and a second distribution of the second output.20. The apparatus of claim 17, wherein the first distribution is aweighted sum of distributions of each of the at least one first output.21. A computer-implemented method comprising: obtaining a first outputsequence from a bidirectional RNN (Recurrent Neural Network) model foran input sequence; obtaining a second output sequence from aunidirectional RNN model for the input sequence; selecting at least onefirst output from the first output sequence, where the at least onefirst output includes a first output that appears sequentially earlierin the first output sequence than a second output appears in the secondsequence; and training the unidirectional RNN model to increase thesimilarity between the at least one first output and the second output.22. The computer-implemented method of claim 21, wherein training theunidirectional RNN model includes training the unidirectional RNN modelto increase the similarity between a first distribution of the at leastone first output and a second distribution of the second output.
 23. Thecomputer-implemented method of claim 22, wherein the first distributionis a weighted sum of distributions of each of the at least one firstoutput.
 24. A computer program product including one or more computerreadable storage mediums collectively storing program instructions thatare executable by a processor or programmable circuitry to cause theprocessor or programmable circuitry to perform operations comprising:obtaining a first output sequence from a bidirectional RNN (RecurrentNeural Network) model for an input sequence; obtaining a second outputsequence from a unidirectional RNN model for the input sequence;selecting at least one first output from the first output sequence,where the at least one first output includes a first output that appearssequentially earlier in the first output sequence than a second outputappears in the second sequence; and training the unidirectional RNNmodel to increase the similarity between the at least one first outputand the second output.
 25. The computer program product of claim 13,wherein training the unidirectional RNN model includes training theunidirectional RNN model to increase the similarity between a firstdistribution of the at least one first output and a second distributionof the second output.